4: mutate(), case_when(), summarize(), across(), factor(), more ggplot2

BSTA 526: R Programming for Health Data Science

Author
Affiliation

Meike Niederhausen, PhD & Jessica Minnier, PhD

OHSU-PSU School of Public Health

Published

January 29, 2026

Modified

January 29, 2026

1 Welcome to R Programming: Part 4!

Today we will cover creating new variables and summarizing existing variables.


Before you get started:

Remember to save this notebook under a new name, such as part_04_b526_YOURNAME.qmd.


  • Load the packages & smoke_complete.xlsx data in the setup code chunk

1.1 Learning Objectives

  • Learn and apply mutate() to change the data type of a variable
  • Apply mutate() to calculate a new variable based on other variables in a data.frame.
  • Apply case_when in a mutate() statement to make a continuous variable categorical
  • Learn how to mutate() across() multiple columns at once.
  • Learn how to summarize() data with group_by() to summarize within categories
  • Learn how to summarize() data with multiple columns and functions at once, also with across().
  • Learn about the factor variable type and how they differ from character vectors
  • Learn to change scales and palettes of ggplots.

2 Working with columns (variables)

2.1 rename() columns

  • A simple handy operation which does exactly what it sounds like.
  • The main thing you need to remember is:

rename(NEWNAME = OLDNAME)

names(smoke_complete)
 [1] "primary_diagnosis"           "tumor_stage"                
 [3] "age_at_diagnosis"            "vital_status"               
 [5] "morphology"                  "days_to_death"              
 [7] "state"                       "tissue_or_organ_of_origin"  
 [9] "days_to_birth"               "site_of_resection_or_biopsy"
[11] "days_to_last_follow_up"      "cigarettes_per_day"         
[13] "years_smoked"                "gender"                     
[15] "year_of_birth"               "race"                       
[17] "ethnicity"                   "year_of_death"              
[19] "bcr_patient_barcode"         "disease"                    
smoke_complete %>% 
  rename(STAGE = tumor_stage) %>%
  names()
 [1] "primary_diagnosis"           "STAGE"                      
 [3] "age_at_diagnosis"            "vital_status"               
 [5] "morphology"                  "days_to_death"              
 [7] "state"                       "tissue_or_organ_of_origin"  
 [9] "days_to_birth"               "site_of_resection_or_biopsy"
[11] "days_to_last_follow_up"      "cigarettes_per_day"         
[13] "years_smoked"                "gender"                     
[15] "year_of_birth"               "race"                       
[17] "ethnicity"                   "year_of_death"              
[19] "bcr_patient_barcode"         "disease"                    

2.2 mutate() adds columns to a dataset

  • mutate() creates new columns (variables), which are usually based on existing columns or are entirely new data
  • It does not change the data within the existing columns
    • (unless you overwrite them with mutate())


  • mutate() is similar to adding a formula in Excel to calculate the value of a new column based on previous columns.
  • You can do lots of things such as:
    • subtract one column from another
    • convert the units of one column to new units (such as days to years)
    • change the capitalization of categories in a variable
    • recode a continuous variable to be a categorical one

2.3 mutate() to calculate a new variable based on others

  • Note that we use = inside mutate, not == or <-:
    • we are not creating a logical operation,
    • nor are we assigning something to an object name.

2.3.1 Example 1: calculate the sum of age_at_diagnoses and days_to_death to get the age_at_death.

smoke_complete %>% 
    mutate(age_at_death = age_at_diagnosis + days_to_death) %>%
  # check:  
  select(age_at_death, age_at_diagnosis, days_to_death) %>% head(5)
# A tibble: 5 × 3
  age_at_death age_at_diagnosis days_to_death
         <dbl>            <dbl>         <dbl>
1        24848            24477           371
2        26751            26615           136
3        30475            28171          2304
4           NA            27154            NA
5           NA            23370            NA


2.3.2 Example 2: convert age_at_diagnosis from age in days to age in years.**

smoke_complete %>% 
    mutate(age_at_diagnosis_yr = age_at_diagnosis/365) %>%
  # check:  
  select(age_at_diagnosis, age_at_diagnosis_yr) %>% head(5)
# A tibble: 5 × 2
  age_at_diagnosis age_at_diagnosis_yr
             <dbl>               <dbl>
1            24477                67.1
2            26615                72.9
3            28171                77.2
4            27154                74.4
5            23370                64.0

2.4 Multiple mutate() & save the new variables

  • Remember, since we haven’t assigned this new “mutated” data frame to an object name, our work hasn’t been saved.
    • Below we save the new dataset as smoke_new. Look for it in the environment tab.
  • Notice also how we can mutate multiple variables at one time by separating the new variables with a ,.
smoke_new <- smoke_complete %>% 
    mutate(
      age_at_death = age_at_diagnosis + days_to_death,  #1st new var
      age_at_diagnosis_yr = age_at_diagnosis/365        #2nd new var
      )

glimpse(smoke_new)
Rows: 1,152
Columns: 22
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <chr> "dead", "dead", "dead", "alive", "alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803, 97…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810, 95…
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA…
$ gender                      <chr> "male", "male", "female", "male", "female"…
$ year_of_birth               <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 1932, …
$ race                        <chr> "white", "asian", "white", "white", "not r…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA, NA…
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
$ age_at_death                <dbl> 24848, 26751, 30475, NA, NA, 19370, 27654,…
$ age_at_diagnosis_yr         <dbl> 67.06027, 72.91781, 77.18082, 74.39452, 64…

2.4.1 Check

  • We can check the new variable age_at_diagnosis_yr by plotting it with age_at_diagnosis in a scatterplot,
    • seeing that they are indeed a simple linear transformation of each other:
ggplot(smoke_new) +
  aes(x = age_at_diagnosis, 
      y = age_at_diagnosis_yr) +
  geom_point()

2.5 Challenge 1 (~10 mins)

  1. Discuss: Suppose we wanted to convert vital_status to a binary 0/1 variable. What is the difference between the two variables created in the code below, and why do each of the versions work? Interpret the results from the tabyl() function (from the janitor package).

  2. Create a variable called cigarettes_total by multiplying cigarettes_per_day by -days_to_birth below.

  3. Create a binary variable cigarettes_high that denotes whether total number of cigarettes is greater than the mean of cigarettes_total.

# change eval = true

# 1. Discuss the code below:
smoke_new <- smoke_complete %>%
  mutate(
    alive = (vital_status == "alive"),
    alive2 = 1*(vital_status == "alive")
  )

smoke_new %>% glimpse()


# What does this code tell us?
smoke_new %>% 
  tabyl(alive, alive2)

# What does this code tell us?
smoke_new %>% 
  tabyl(vital_status, alive2)

#-----------------------------------------------
# 2. Calculate total number of cigarettes so far
smoke_new <- smoke_new %>%
  mutate(cigarettes_total = )


#-----------------------------------------------
# 3. Create a binary variable that denotes whether total # cigs is greater than the mean cigarettes_total
smoke_new <- smoke_new %>%
  mutate(cigarettes_high = )

# check this makes sense
smoke_new %>% tabyl(cigarettes_high)

3 factor variables

3.1 factor variables

  • factors are how R represents categorical data.
  • For the most part, you can use character and factor variables interchangeably for categorical data.
  • The main differences are that factors
    • define the permissible values (categories) for a variable with the levels argument
    • they also define the order in which these values are displayed.
character_vector <- c("Dog", "Dog", "Cat", "Mouse")
table(character_vector)
character_vector
  Cat   Dog Mouse 
    1     2     1 
  • Use the factor() function to convert a character vector into a factor vector
    • Make sure to include the argument called levels to define the permissible values

Note: highly recommend this short video on factors by Prof Kelly Bodwin.

3.2 levels of a factor

  • The levels of a factor are the permissible values in a factor.
  • The order of the levels control the order in which the values appear in tables and on the axes in a plot.
  • You can find out the levels of a factor variable with the function levels() or table() also gives you a clue. The tabyl() function in the janitor package also shows factors in their order.
factor_vector <- factor(character_vector)

# what are the levels of this vector?
levels(factor_vector)
[1] "Cat"   "Dog"   "Mouse"
# specify our own ordering of levels
factor_vector2 <- factor(character_vector, 
                        levels = c("Dog", "Cat", "Mouse"))

levels(factor_vector2)
[1] "Dog"   "Cat"   "Mouse"
table(factor_vector)
factor_vector
  Cat   Dog Mouse 
    1     2     1 
table(factor_vector2)
factor_vector2
  Dog   Cat Mouse 
    2     1     1 

Being able to order levels is the main reason to use factors, at least in plotting and doing counts.

3.3 Mini-challenge 1

  1. Change the order of factor_vector to be “Mouse”, “Cat”, “Dog”.
  2. Verify that you did it correctly by calling tabyl().
factor_vector <- factor(character_vector, 
                        levels = c("Dog", "Cat", "Mouse"))

3.4 Create factor variables using mutate

  • We can also use use mutate() to make a character variable a factor.
  • Let’s convert gender from character into factor.
  • Then pipe the output into glimpse() so we can see the variable types.

Note factor() is the same as as.factor()

smoke_complete %>% 
    #reassign the gender variable to be a factor
    mutate(gender = factor(gender)) %>%
    glimpse()
Rows: 1,152
Columns: 20
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <chr> "dead", "dead", "dead", "alive", "alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803, 97…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810, 95…
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA…
$ gender                      <fct> male, male, female, male, female, male, ma…
$ year_of_birth               <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 1932, …
$ race                        <chr> "white", "asian", "white", "white", "not r…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA, NA…
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
  • One thing to notice: we are doing something called reassignment here.
    • We’re taking the previous values of our variable (gender),
    • doing something to it (making it a factor), and
    • then reassigning the variable gender to our fixed set of values.

Change the order of the levels to first female and then male.

  • Assign the levels order using the levels argument in factor().
  • Show the order of the levels by piping the output into tabyl()
smoke_complete %>% 
    #reassign the gender variable to be a factor
    mutate(gender = factor(gender, levels = c("female", "male"))) %>%
    tabyl(gender)
 gender   n   percent
 female 366 0.3177083
   male 786 0.6822917

Notice that the female value is before the male, which is what we wanted.

3.5 ggplot2 & factor variables

  • Remember, vital_status is a character vector
  • If we do not care about the order of the categories, we can use it as is when creating figures.
# create our data, save as a new df smoke_new
smoke_new <- smoke_complete %>%
  mutate(
    alive = (vital_status == "alive"),
    alive2 = 1*(vital_status == "alive"),
    cigarettes_total = -days_to_birth*cigarettes_per_day,
    cigarettes_high = 1*(cigarettes_total > mean(cigarettes_total, na.rm = TRUE)),
    gender = factor(gender, levels = c("female", "male"))
  )
  • Create boxplot of cigarettes_total stratified by vital_status (character variable):
ggplot(smoke_new) +
  aes(x = vital_status, 
      y = cigarettes_total,
      fill = vital_status) +
  geom_boxplot()

3.6 Add factor levels before plotting

  • Perhaps we want to change the order of the vital_status values in our plot.
  • This is where factor levels are useful.
smoke_new <- smoke_new %>% 
  mutate(
    vital_status = factor(vital_status, 
                          levels = c("dead", "alive")))

ggplot(smoke_new) +
  aes(x = vital_status, 
      y = cigarettes_total, 
      fill = vital_status) +
  geom_boxplot()

Let’s look at a variable with more than 2 categories:

smoke_new <- smoke_new %>% 
  mutate(
    disease_fac = factor(disease, 
                         levels = c("LUSC", "BLCA", "CESC")))

# original disease categories, a character vector
ggplot(smoke_new) +
  aes(x = disease, 
      y = cigarettes_total, 
      fill = disease) +
  geom_boxplot()

# with factor levels changed
ggplot(smoke_new) +
  aes(x = disease_fac, 
      y = cigarettes_total, 
      fill = disease_fac) +
  geom_boxplot()

4 replace_na

  • Replace missing values: use mutate with replace_na

4.1 Replace missing values: use mutate with replace_na

  • Sometimes we want to fill in missing values with a certain value.
  • Use the replace_na() function inside of mutate() to specify this.


4.1.1 Example

  • For example, if the days to last follow up is missing, we want to set their days to follow up to 0.
    • The first two observations in these data have missing follow up, but when we replace them we note they are now equal to 0:
summary(smoke_new$days_to_last_follow_up)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  -64.0   296.0   641.0   944.9  1224.0  6375.0     271 
smoke_new %>% select(days_to_last_follow_up) %>% head()
# A tibble: 6 × 1
  days_to_last_follow_up
                   <dbl>
1                     NA
2                     NA
3                   2099
4                   3747
5                   3576
6                     NA
smoke_new <- smoke_new %>%
    mutate(
      days_to_last_follow_up2 = replace_na(days_to_last_follow_up, 0)
      )

# only select these columns for printing, work has not been saved
smoke_new %>% 
  select(contains("follow_up")) %>% 
  head()
# A tibble: 6 × 2
  days_to_last_follow_up days_to_last_follow_up2
                   <dbl>                   <dbl>
1                     NA                       0
2                     NA                       0
3                   2099                    2099
4                   3747                    3747
5                   3576                    3576
6                     NA                       0

5 case_when()

Create new variables based on multiple conditions

5.1 Use case_when() within mutate to make a continuous variable categorical

  • Say we want to make the cigarettes_per_day into a categorical variable with the values:
    • 0-5 cigarettes/day
    • 6+ cigarettes/day
  • How would we do that?
    • We saw one way to create a binary variable,
      • i.e. 1*(cigarettes_per_day >= 6),
    • but here we want to use case_when(),
      • which, as you will see, is much more flexible.


With case_when(), we need to follow this basic pattern for each of our categories:

condition ~ category name

  • The left side of the ~ is where we define the category based on our column variable names.
  • The right side of the ~ is where we specify the category name (as a character).


  • In the example below,
    • cigarettes_per_day < 6 is our left side, and
    • 0-5 is our right side (our category name).
  • We need to do this for each level in our category.
# note this work is not saved, no assignment operator!
smoke_new %>%
  # case_when() is used *inside* mutate()
  mutate(
    cigarettes_category = case_when(        # new column name to left of =
      # TRUE/FALSE statement to left of ~
      cigarettes_per_day < 6 ~ "0-5",       # don't forget a comma between cases
      cigarettes_per_day >= 6 ~ "6+"
    )
  ) %>%
  mutate(cigarettes_category = factor(cigarettes_category)) %>%
  tabyl(cigarettes_category)
 cigarettes_category    n    percent
                 0-5 1100 0.95486111
                  6+   52 0.04513889

5.2 Mini-challenge 2 (~5 minutes)

  • Modify the code below to recode cigarettes_category to have 3 levels:
    • 0-2
    • 3-5 (everything bigger than 2 and less than 6)
    • 6+

Hint: you’ll have to chain conditions with an & to get the 3-5 category.

smoke_new %>%
  mutate(
    cigarettes_category = case_when(
      cigarettes_per_day <= 2 ~ "0-2",
      
      ------ ~ ------, # fill this in
      
      cigarettes_per_day >= 6 ~ "6+"
    )
  ) %>%
  mutate(cigarettes_category = factor(cigarettes_category)) %>%
  tabyl(cigarettes_category)

5.3 case_when with character vectors

Goal: collapse the tumor stages stages into three categories: stage i, stage ii, and everything else.

smoke_new %>% tabyl(tumor_stage)
  tumor_stage   n     percent
 not reported  99 0.085937500
      stage i   7 0.006076389
     stage ia 146 0.126736111
     stage ib 266 0.230902778
     stage ii  65 0.056423611
    stage iia 112 0.097222222
    stage iib 148 0.128472222
    stage iii  86 0.074652778
   stage iiia 102 0.088541667
   stage iiib  30 0.026041667
     stage iv  91 0.078993056

Some important points in this example:

  • We could use many “|” statements (“or” statements),
    • or we could define multiple cases for the same category.
  • This works because of how case_when works:
    • if THIS, then category.
    • else if THIS, then category.
    • else if THIS, then category…
    • and on and on
  • The .default at the end is the value assigned when all other conditions return FALSE or NA,
    • i.e. “else default to this”.
    • If left blank or not specified then a missing value will be used.
smoke_new <- smoke_new %>%
  mutate(
    # new column name = case_when()
    stage_category = case_when(
      # group these tumor stages together as "i"
      tumor_stage == "stage i" ~ "i",
      tumor_stage == "stage ia" ~ "i",
      tumor_stage == "stage ib" ~ "i",
      # group these together as "ii"
      tumor_stage == "stage ii" ~ "ii",
      tumor_stage == "stage iia" ~ "ii",
      tumor_stage == "stage iib" ~ "ii",
      # else if, then categorize as "other"
      .default = "other"
    )
    )

smoke_new %>% tabyl(tumor_stage, stage_category)
  tumor_stage   i  ii other
 not reported   0   0    99
      stage i   7   0     0
     stage ia 146   0     0
     stage ib 266   0     0
     stage ii   0  65     0
    stage iia   0 112     0
    stage iib   0 148     0
    stage iii   0   0    86
   stage iiia   0   0   102
   stage iiib   0   0    30
     stage iv   0   0    91

Using |

smoke_new <- smoke_new %>%
  mutate(
    # new column name = case_when()
    stage_category = case_when(
      # group these tumor stages together as "i"
      tumor_stage == "stage i" | 
        tumor_stage == "stage ia" | 
        tumor_stage == "stage ib" ~ "i",
      # group these together as "ii"
      tumor_stage == "stage ii" |
        tumor_stage == "stage iia" |
        tumor_stage == "stage iib" ~ "ii",
      # else if, then categorize as "other"
      .default = "other"
    )
    )

smoke_new %>% tabyl(tumor_stage, stage_category)
  tumor_stage   i  ii other
 not reported   0   0    99
      stage i   7   0     0
     stage ia 146   0     0
     stage ib 266   0     0
     stage ii   0  65     0
    stage iia   0 112     0
    stage iib   0 148     0
    stage iii   0   0    86
   stage iiia   0   0   102
   stage iiib   0   0    30
     stage iv   0   0    91

Using %in%

smoke_new <- smoke_new %>%
  mutate(
    # new column name = case_when()
    stage_category = case_when(
      # group these tumor stages together as "i"
      tumor_stage %in% c("stage i", "stage ia", "stage ib") ~ "i",
      # group these together as "ii"
      tumor_stage %in% c("stage ii", "stage iia", "stage iib") ~ "ii",
      # else if, then categorize as "other"
      .default = "other"
    )
    )

smoke_new %>% tabyl(tumor_stage, stage_category)
  tumor_stage   i  ii other
 not reported   0   0    99
      stage i   7   0     0
     stage ia 146   0     0
     stage ib 266   0     0
     stage ii   0  65     0
    stage iia   0 112     0
    stage iib   0 148     0
    stage iii   0   0    86
   stage iiia   0   0   102
   stage iiib   0   0    30
     stage iv   0   0    91

6 mutate() across multiple columns with across()

6.1 mutate() across multiple columns with across()

  • Often we may want to mutate multiple variables in the same way.
  • One option is to follow the examples above and specify each operation individually.


  • For example, suppose we want to convert the values of come character columns into “title case”.
    • There’s a function in the stringr package that does this:
myvec <- c("people", "thing")
stringr::str_to_title(myvec)
[1] "People" "Thing" 
  • We can apply this to multiple columns one at a time:
smoke_complete %>%
    mutate(tumor_stage = str_to_title(tumor_stage),
           vital_status = str_to_title(vital_status),
           gender = str_to_title(gender),
           race = str_to_title(race)) %>%
  glimpse()
Rows: 1,152
Columns: 20
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "Stage Ia", "Stage Ib", "Stage Ib", "Stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <chr> "Dead", "Dead", "Dead", "Alive", "Alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803, 97…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810, 95…
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA…
$ gender                      <chr> "Male", "Male", "Female", "Male", "Female"…
$ year_of_birth               <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 1932, …
$ race                        <chr> "White", "Asian", "White", "White", "Not R…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA, NA…
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
  • Phew, that’s a lot of typing! But we can use across() to automate this.
smoke_complete %>%
    mutate(
      across(.cols = c(tumor_stage, vital_status, gender, race),
             .fns = str_to_title)
      ) %>%
  glimpse()
Rows: 1,152
Columns: 20
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "Stage Ia", "Stage Ib", "Stage Ib", "Stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <chr> "Dead", "Dead", "Dead", "Alive", "Alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803, 97…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810, 95…
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA…
$ gender                      <chr> "Male", "Male", "Female", "Male", "Female"…
$ year_of_birth               <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 1932, …
$ race                        <chr> "White", "Asian", "White", "White", "Not R…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA, NA…
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…

6.2 How do we use across()?

  • The above example uses the across() function inside mutate() function.
    • Note that mutate() wraps fully around across() so it’s a nested function call on the data:

data %>% mutate(across(____))

  • This is an advanced way of using mutate() and also summarize(), as we will see later today.


  • across() has two important arguments
    • .cols selects the columns you want to operate on.
      • It uses tidyselect (like select()), so you can pick variables by position, name, and type
    • .fns is the function or list of functions to apply to each column.
      • This can also be a purrr style formula (or list of formulas) like ~ .x/2.
        • This will take some practice. We will see this again when we go over purrr in later classes.


6.3 across() examples

  • Suppose we want to convert all the days variables to years.
  • One column at a time looks like this:
smoke_complete %>%
    mutate(
      years_to_death = days_to_death/365,
      years_to_birth = days_to_birth/365,
      years_to_last_follow_up = days_to_last_follow_up/365
    ) %>% 
  # check
  select(contains("_to_"))
# A tibble: 1,152 × 6
   days_to_death days_to_birth days_to_last_follow_up years_to_death
           <dbl>         <dbl>                  <dbl>          <dbl>
 1           371        -24477                     NA          1.02 
 2           136        -26615                     NA          0.373
 3          2304        -28171                   2099          6.31 
 4            NA        -27154                   3747         NA    
 5            NA        -23370                   3576         NA    
 6           345        -19025                     NA          0.945
 7           716        -26938                     NA          1.96 
 8          2803        -28430                   1810          7.68 
 9           973        -30435                    956          2.67 
10          1097        -24019                    758          3.01 
# ℹ 1,142 more rows
# ℹ 2 more variables: years_to_birth <dbl>, years_to_last_follow_up <dbl>
  • Below is the across() automated way,
    • using our own math/operation a.k.a “function”.
  • across()
    • first asks which variables to use (.cols =)
    • then asks which operation(s) (.fns =).
  • Below we create our own function ~ .x/365
    • We need the ~ when we define our own operation or function.
      • This tells R, function starts here!
    • The .x is a placeholder for the column(s) input,
      • think of it as algebra with the math operation: column/365.
smoke_complete %>%
    mutate(
        across(.cols = c(days_to_death, 
                         days_to_birth, 
                         days_to_last_follow_up), 
               .fns = ~ .x/365)) %>% 
  # check
  select(contains("_to_"))
# A tibble: 1,152 × 3
   days_to_death days_to_birth days_to_last_follow_up
           <dbl>         <dbl>                  <dbl>
 1         1.02          -67.1                  NA   
 2         0.373         -72.9                  NA   
 3         6.31          -77.2                   5.75
 4        NA             -74.4                  10.3 
 5        NA             -64.0                   9.80
 6         0.945         -52.1                  NA   
 7         1.96          -73.8                  NA   
 8         7.68          -77.9                   4.96
 9         2.67          -83.4                   2.62
10         3.01          -65.8                   2.08
# ℹ 1,142 more rows
  • In take 1, we mutated the variables “in place”,
    • that is, we rewrote our data with the mutated data in “years”
    • but did not change the names!
  • This can be confusing.
  • There are different ways to fix this.
  • Right now, we will use a simple method,
    • with the rename function,
    • which renames our columns (essentially a wrapper of mutate()).
smoke_days <- smoke_complete %>%
    mutate(
        across(.cols = c(days_to_death, days_to_birth, days_to_last_follow_up), 
               .fns = ~ .x/365)) %>%
  rename(
    years_to_death = days_to_death,
    years_to_birth = days_to_birth,
    years_to_last_follow_up = days_to_last_follow_up
  ) 
# check
smoke_days %>% select(contains("_to_"))
# A tibble: 1,152 × 3
   years_to_death years_to_birth years_to_last_follow_up
            <dbl>          <dbl>                   <dbl>
 1          1.02           -67.1                   NA   
 2          0.373          -72.9                   NA   
 3          6.31           -77.2                    5.75
 4         NA              -74.4                   10.3 
 5         NA              -64.0                    9.80
 6          0.945          -52.1                   NA   
 7          1.96           -73.8                   NA   
 8          7.68           -77.9                    4.96
 9          2.67           -83.4                    2.62
10          3.01           -65.8                    2.08
# ℹ 1,142 more rows
  • As an advanced preview, once we learn about
    • glue to manipulate strings, and
    • stringr package functions such as str_replace
  • we can change the names using the .names= argument inside across() with these methods:
smoke_days <- smoke_complete %>%
    mutate(
        across(.cols = c(days_to_death, days_to_birth, days_to_last_follow_up), 
               .fns = ~ .x/365,
               .names = "{str_replace(.col, 'years','days')}")) 

smoke_days %>% select(contains("_to_"))
# A tibble: 1,152 × 3
   days_to_death days_to_birth days_to_last_follow_up
           <dbl>         <dbl>                  <dbl>
 1         1.02          -67.1                  NA   
 2         0.373         -72.9                  NA   
 3         6.31          -77.2                   5.75
 4        NA             -74.4                  10.3 
 5        NA             -64.0                   9.80
 6         0.945         -52.1                  NA   
 7         1.96          -73.8                  NA   
 8         7.68          -77.9                   4.96
 9         2.67          -83.4                   2.62
10         3.01          -65.8                   2.08
# ℹ 1,142 more rows
  • What if we had many, many “days_to” columns?
    • That still is a lot of typing…
  • We can actually use across() in a more powerful with tidyselect.

First, remember how we can use select() with the tidyselect syntax, such as:

smoke_complete %>% select(starts_with("days"))
# A tibble: 1,152 × 3
   days_to_death days_to_birth days_to_last_follow_up
           <dbl>         <dbl>                  <dbl>
 1           371        -24477                     NA
 2           136        -26615                     NA
 3          2304        -28171                   2099
 4            NA        -27154                   3747
 5            NA        -23370                   3576
 6           345        -19025                     NA
 7           716        -26938                     NA
 8          2803        -28430                   1810
 9           973        -30435                    956
10          1097        -24019                    758
# ℹ 1,142 more rows

See more examples here.

  • We can use tidyselect to choose our variables within across(),
    • for instance, we could call the same variables by using starts_with("days").
smoke_days <- smoke_complete %>%
    mutate(
        across(.cols = starts_with("days"), # now we are using tidyselect here
               .fns = ~ .x/365)) %>%
  rename(
    years_to_death = days_to_death,
    years_to_birth = days_to_birth,
    years_to_last_follow_up = days_to_last_follow_up
  ) 

smoke_days %>% select(contains("_to_"))
# A tibble: 1,152 × 3
   years_to_death years_to_birth years_to_last_follow_up
            <dbl>          <dbl>                   <dbl>
 1          1.02           -67.1                   NA   
 2          0.373          -72.9                   NA   
 3          6.31           -77.2                    5.75
 4         NA              -74.4                   10.3 
 5         NA              -64.0                    9.80
 6          0.945          -52.1                   NA   
 7          1.96           -73.8                   NA   
 8          7.68           -77.9                    4.96
 9          2.67           -83.4                    2.62
10          3.01           -65.8                    2.08
# ℹ 1,142 more rows

6.4 Using across() takes some getting used

  • Using across() takes some getting used to,
    • but seeing more examples will help you see how it works.
  • There are also more complex ways to use your own functions and naming conventions.


6.5 Using built-in function with across()

  • We can also use built in function names (instead of ~.x/365 for example)

Replace all missing numeric values with 0

  • Below we
    • select the numeric columns using where(is.numeric)
      • (where() is a tidyselect function that selects those for which the function returns TRUE), and
    • replace missing values with 0
      • (replace = is an argument of replace_na()).
smoke_complete %>%
  mutate(
    # replace is an argument of replace_na
    across(.cols = where(is.numeric), 
           .fns = ~ replace_na(.x, replace = 0))) %>%
  # check
  select(where(is.numeric)) %>% 
  summary()   # note no NAs!
 age_at_diagnosis days_to_death    days_to_birth    days_to_last_follow_up
 Min.   : 7855    Min.   :   0.0   Min.   :-32872   Min.   : -64.0        
 1st Qu.:22069    1st Qu.:   0.0   1st Qu.:-26927   1st Qu.:   7.5        
 Median :24750    Median :   0.0   Median :-24750   Median : 441.0        
 Mean   :24175    Mean   : 371.5   Mean   :-24175   Mean   : 722.6        
 3rd Qu.:26927    3rd Qu.: 461.8   3rd Qu.:-22069   3rd Qu.: 987.0        
 Max.   :32872    Max.   :5287.0   Max.   : -7855   Max.   :6375.0        
 cigarettes_per_day  years_smoked   year_of_birth  year_of_death   
 Min.   : 0.00822   Min.   : 0.00   Min.   :   0   Min.   :   0.0  
 1st Qu.: 1.36986   1st Qu.: 0.00   1st Qu.:1933   1st Qu.:   0.0  
 Median : 2.19178   Median : 0.00   Median :1940   Median :   0.0  
 Mean   : 2.60800   Mean   :14.56   Mean   :1883   Mean   : 616.6  
 3rd Qu.: 3.28767   3rd Qu.:35.00   3rd Qu.:1949   3rd Qu.:2004.0  
 Max.   :40.00000   Max.   :63.00   Max.   :1986   Max.   :2013.0  

Convert all character columns to title case

  • Use the where() function to select all columns that are a character type
  • is.character() returns
    • TRUE if the column is a character type, and
    • FALSE otherwise
smoke_complete %>%
    mutate(
      across(.cols = where(is.character),
             .fns = str_to_title)  # note no ~ and .x
      ) %>%
  # check
  select(where(is.character)) %>% 
  glimpse()
Rows: 1,152
Columns: 12
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "Stage Ia", "Stage Ib", "Stage Ib", "Stage…
$ vital_status                <chr> "Dead", "Dead", "Dead", "Alive", "Alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ state                       <chr> "Live", "Live", "Live", "Live", "Live", "L…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ gender                      <chr> "Male", "Male", "Female", "Male", "Female"…
$ race                        <chr> "White", "Asian", "White", "White", "Not R…
$ ethnicity                   <chr> "Not Hispanic Or Latino", "Not Hispanic Or…
$ bcr_patient_barcode         <chr> "Tcga-18-3406", "Tcga-18-3407", "Tcga-18-3…
$ disease                     <chr> "Lusc", "Lusc", "Lusc", "Lusc", "Lusc", "L…
  • In the above example, I didn’t need to specify any arguments for the str_to_title() function.
  • Thus I was able to simplify the .fns part of the code with just .fns = str_to_title
  • The long (purrr) way works as well
    • If you use the ~, you have to add the (.x)
smoke_complete %>%
    mutate(
      across(.cols = where(is.character),
             .fns = ~ str_to_title(.x))  # with ~ and .x
      ) %>%
  # check
  select(where(is.character)) %>% 
  glimpse()
Rows: 1,152
Columns: 12
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "Stage Ia", "Stage Ib", "Stage Ib", "Stage…
$ vital_status                <chr> "Dead", "Dead", "Dead", "Alive", "Alive", …
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ state                       <chr> "Live", "Live", "Live", "Live", "Live", "L…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ gender                      <chr> "Male", "Male", "Female", "Male", "Female"…
$ race                        <chr> "White", "Asian", "White", "White", "Not R…
$ ethnicity                   <chr> "Not Hispanic Or Latino", "Not Hispanic Or…
$ bcr_patient_barcode         <chr> "Tcga-18-3406", "Tcga-18-3407", "Tcga-18-3…
$ disease                     <chr> "Lusc", "Lusc", "Lusc", "Lusc", "Lusc", "L…

Convert all character columns to factors

smoke_complete %>%
    mutate(
      across(.cols = where(is.character), 
             .fns = as.factor)) %>%   # factor works as well
  # check  
  glimpse()
Rows: 1,152
Columns: 20
$ primary_diagnosis           <fct> C34.1, C34.1, C34.3, C34.1, C34.1, C34.3, …
$ tumor_stage                 <fct> stage ia, stage ib, stage ib, stage ia, st…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <fct> dead, dead, dead, alive, alive, dead, dead…
$ morphology                  <fct> 8070/3, 8070/3, 8070/3, 8083/3, 8070/3, 80…
$ days_to_death               <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803, 97…
$ state                       <fct> live, live, live, live, live, live, live, …
$ tissue_or_organ_of_origin   <fct> C34.1, C34.1, C34.3, C34.1, C34.1, C34.3, …
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <fct> C34.1, C34.1, C34.3, C34.1, C34.1, C34.3, …
$ days_to_last_follow_up      <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810, 95…
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA…
$ gender                      <fct> male, male, female, male, female, male, ma…
$ year_of_birth               <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 1932, …
$ race                        <fct> white, asian, white, white, not reported, …
$ ethnicity                   <fct> not hispanic or latino, not hispanic or la…
$ year_of_death               <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA, NA…
$ bcr_patient_barcode         <fct> TCGA-18-3406, TCGA-18-3407, TCGA-18-3408, …
$ disease                     <fct> LUSC, LUSC, LUSC, LUSC, LUSC, LUSC, LUSC, …

6.6 Challenge 2 (~10 minutes)

data(penguins) # from the palmerpenguins package
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
  1. Convert all variables in the palmerpenguins data set that end with mm to cm by dividing by 10
penguins %>%
    mutate(
        across(.cols = ----, 
               .fns = ~ -----)) %>%
    glimpse()
  1. Convert all factor variables to numeric:
penguins %>%
  mutate(
    across(.cols = ----, 
           .fns = ~ -----)) %>%
  ) %>%
  glimpse()
  1. Convert all factor vectors to character, then convert their values to all caps:
penguins %>%
  mutate(
    
  )

7 summarize() data

The summarize() verb produces (surprise!) summaries of your data, and outputs a tibble with this information.

7.1 Basic summarize() example

  • Find the average (mean) days_to_last_follow_up:
smoke_new %>%
  summarize(days_to_last_follow_up = mean(days_to_last_follow_up))
# A tibble: 1 × 1
  days_to_last_follow_up
                   <dbl>
1                     NA
  • I forgot an important argument to my mean function!
smoke_new %>%
    summarize(days_to_last_follow_up = 
                mean(days_to_last_follow_up, na.rm = TRUE)) 
# A tibble: 1 × 1
  days_to_last_follow_up
                   <dbl>
1                   945.
  • Compare to the base R way:
mean(smoke_new$days_to_last_follow_up, na.rm = TRUE)
[1] 944.8547

7.2 What are the kinds of things that summarize is useful for?

Useful functions (from https://dplyr.tidyverse.org/reference/summarise.html)

  • Center: mean(), median()
  • Spread: sd() (standard deviation)
  • Range: min(), max()
  • Position: first(), last(), nth()
  • Count: n(), n_distinct()
  • Percentiles: quantile()

By itself, summarize() is nice. But it’s really when combined with group_by() that it becomes extremely powerful.

7.3 group_by() and summarize()

  • These two verbs must go together.
    • group_by() doesn’t do anything by itself, output wise.
  • What group_by() does is split our dataset
    • into a number of smaller datasets split out by category
    • (shown as different colors in figure below)
  • Then we use summarize to do a summary calculation (such as counting or calculating the mean) on the smaller datasets

7.4 Example: group_by() and summarize()

  • For example, if we want to compare the mean age_at_diagnosis for each disease type, we’d do the following:
smoke_new %>%
  group_by(disease) %>%
  summarize(average_age = mean(age_at_diagnosis, na.rm = TRUE))
# A tibble: 3 × 2
  disease average_age
  <chr>         <dbl>
1 BLCA         24931.
2 CESC         16973.
3 LUSC         24765.
  • When we group_by() a variable, it changes an attribute/property of our data, which we can see in glimpse():
smoke_new %>%
  group_by(disease) %>%
  glimpse()
Rows: 1,152
Columns: 27
Groups: disease [3]
$ primary_diagnosis           <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ tumor_stage                 <chr> "stage ia", "stage ib", "stage ib", "stage…
$ age_at_diagnosis            <dbl> 24477, 26615, 28171, 27154, 23370, 19025, …
$ vital_status                <fct> dead, dead, dead, alive, alive, dead, dead…
$ morphology                  <chr> "8070/3", "8070/3", "8070/3", "8083/3", "8…
$ days_to_death               <dbl> 371, 136, 2304, NA, NA, 345, 716, 2803, 97…
$ state                       <chr> "live", "live", "live", "live", "live", "l…
$ tissue_or_organ_of_origin   <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_birth               <dbl> -24477, -26615, -28171, -27154, -23370, -1…
$ site_of_resection_or_biopsy <chr> "C34.1", "C34.1", "C34.3", "C34.1", "C34.1…
$ days_to_last_follow_up      <dbl> NA, NA, 2099, 3747, 3576, NA, NA, 1810, 95…
$ cigarettes_per_day          <dbl> 10.9589041, 2.1917808, 1.6438356, 1.095890…
$ years_smoked                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 26, NA…
$ gender                      <fct> male, male, female, male, female, male, ma…
$ year_of_birth               <dbl> 1936, 1931, 1927, 1930, 1942, 1953, 1932, …
$ race                        <chr> "white", "asian", "white", "white", "not r…
$ ethnicity                   <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death               <dbl> 2004, 2003, NA, NA, NA, 2005, 2006, NA, NA…
$ bcr_patient_barcode         <chr> "TCGA-18-3406", "TCGA-18-3407", "TCGA-18-3…
$ disease                     <chr> "LUSC", "LUSC", "LUSC", "LUSC", "LUSC", "L…
$ alive                       <lgl> FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FA…
$ alive2                      <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, …
$ cigarettes_total            <dbl> 268241.10, 58334.25, 46308.49, 29757.81, 6…
$ cigarettes_high             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
$ disease_fac                 <fct> LUSC, LUSC, LUSC, LUSC, LUSC, LUSC, LUSC, …
$ days_to_last_follow_up2     <dbl> 0, 0, 2099, 3747, 3576, 0, 0, 1810, 956, 7…
$ stage_category              <chr> "i", "i", "i", "i", "other", "i", "other",…

7.5 Example: calculate means and standard deviations with group_by() and summarize()

  • We can have multiple summaries within one summarize() function.
  • Calculate both the mean age at diagnosis and the standard deviation, for each disease type.
smoke_new %>%
  group_by(disease) %>%
  summarize(mean_age = mean(age_at_diagnosis, na.rm = TRUE),
            sd_age = sd(age_at_diagnosis, na.rm = TRUE))
# A tibble: 3 × 3
  disease mean_age sd_age
  <chr>      <dbl>  <dbl>
1 BLCA      24931.  3571.
2 CESC      16973.  4717.
3 LUSC      24765.  3033.

7.6 group_by multiple variables and summarize()

  • Calculate
    • both the mean age at diagnosis and the standard deviation,
    • for each disease type and gender
  • Add gender as a grouping variable:
smoke_new %>%
  group_by(disease, gender) %>%
  summarize(mean_age = mean(age_at_diagnosis, na.rm = TRUE),
            sd_age = sd(age_at_diagnosis, na.rm = TRUE))
# A tibble: 5 × 4
# Groups:   disease [3]
  disease gender mean_age sd_age
  <chr>   <fct>     <dbl>  <dbl>
1 BLCA    female   24873.  3821.
2 BLCA    male     24950.  3499.
3 CESC    female   16973.  4717.
4 LUSC    female   24900.  3117.
5 LUSC    male     24717.  3004.

7.7 count() frequencies with group_by()

  • What if we want to know the number of people in our dataset within each disease and gender combination?
  • We can use the n() function.
smoke_new %>%
  group_by(disease, gender) %>%
    summarize(count = n())
# A tibble: 5 × 3
# Groups:   disease [3]
  disease gender count
  <chr>   <fct>  <int>
1 BLCA    female    54
2 BLCA    male     170
3 CESC    female    92
4 LUSC    female   220
5 LUSC    male     616
  • We can also use the count() function directly.
  • However, knowing n() is a useful summarization to add to a bunch of summarizing functions in one summarize() statement.
smoke_new %>%
    count(disease, gender)
# A tibble: 5 × 3
  disease gender     n
  <chr>   <fct>  <int>
1 BLCA    female    54
2 BLCA    male     170
3 CESC    female    92
4 LUSC    female   220
5 LUSC    male     616
  • Note: we can also create a two-way table to get these frequenices.
    • However, it’s sometimes handy to have the counts in a tibble to use for something else (such as creating figures).
smoke_new %>%
    tabyl(disease, gender)
 disease female male
    BLCA     54  170
    CESC     92    0
    LUSC    220  616

7.8 summarize() with across()

  • In the same way that we used across() to mutate() several columns,
    • we can use across() to summarize several columns.
  • Calculate the mean of all the numeric variables.
  • Remember that when we use a named function like mean() in across(),
    • we have to use the ~function(.x, arguments) syntax
    • like this example with na.rm = TRUE:
smoke_new %>%
  group_by(disease) %>%
  summarize(across(.cols = where(is.numeric), 
                   .fns = ~ mean(.x, na.rm = TRUE)  
                   ))
# A tibble: 3 × 13
  disease age_at_diagnosis days_to_death days_to_birth days_to_last_follow_up
  <chr>              <dbl>         <dbl>         <dbl>                  <dbl>
1 BLCA              24931.          576.       -24931.                   778.
2 CESC              16973.         1100.       -16973.                  1225.
3 LUSC              24765.          911.       -24765.                   957.
# ℹ 8 more variables: cigarettes_per_day <dbl>, years_smoked <dbl>,
#   year_of_birth <dbl>, year_of_death <dbl>, alive2 <dbl>,
#   cigarettes_total <dbl>, cigarettes_high <dbl>,
#   days_to_last_follow_up2 <dbl>
  • Count the number of distinct values for each character vector:
smoke_new %>%
  group_by(disease) %>%
  summarise(across(where(is.character), 
                   n_distinct))   # no function arguments needed, so can use simple way
# A tibble: 3 × 11
  disease primary_diagnosis tumor_stage morphology state tissue_or_organ_of_or…¹
  <chr>               <int>       <int>      <int> <int>                   <int>
1 BLCA                    7           5          2     1                       7
2 CESC                    3           1          9     1                       3
3 LUSC                    7          11          6     1                       6
# ℹ abbreviated name: ¹​tissue_or_organ_of_origin
# ℹ 5 more variables: site_of_resection_or_biopsy <int>, race <int>,
#   ethnicity <int>, bcr_patient_barcode <int>, stage_category <int>
  • Note that “.cols =” and “.fns =” can be omitted

7.9 Side note about gt::gt()

  • The gt package can make our tables look much nicer in html output.
  • We will talk about customizing the gt() output later in the quarter,
    • but for now, the default output is pretty nice
# before gt
smoke_new %>%
  group_by(disease) %>%
  summarize(across(.cols = starts_with("day"), 
                   .fns = ~ mean(.x, na.rm = TRUE)
                   ))
# A tibble: 3 × 5
  disease days_to_death days_to_birth days_to_last_follow_up
  <chr>           <dbl>         <dbl>                  <dbl>
1 BLCA             576.       -24931.                   778.
2 CESC            1100.       -16973.                  1225.
3 LUSC             911.       -24765.                   957.
# ℹ 1 more variable: days_to_last_follow_up2 <dbl>
# after gt
smoke_new %>%
  group_by(disease) %>%
  summarize(across(.cols = starts_with("day"), 
                   .fns = ~ mean(.x, na.rm = TRUE)
                   )) %>%
  gt()               # makes the table look nice in html
disease days_to_death days_to_birth days_to_last_follow_up days_to_last_follow_up2
BLCA 575.7059 -24931.12 778.0870 559.2500
CESC 1099.5769 -16973.26 1225.3676 905.7065
LUSC 910.8984 -24765.46 956.7791 746.1962
# note that the gt package got installed & loaded in the setup code chunk

7.10 summarize() multiple functions at one time

  • We can summarize the same function across multiple columns,
    • but we can also summarize with multiple functions using across!
  • Note that we specify a list of functions.
    • We will talk a lot more about lists in a couple weeks.
smoke_new %>%
  group_by(disease) %>%
  summarize(across(c(age_at_diagnosis, days_to_death),
                   .fns = list(
                     min = ~ min(.x, na.rm = TRUE),
                     mean = ~ mean(.x, na.rm = TRUE),
                     max = ~ max(.x, na.rm = TRUE))
                   )) %>%
  gt()
disease age_at_diagnosis_min age_at_diagnosis_mean age_at_diagnosis_max days_to_death_min days_to_death_mean days_to_death_max
BLCA 13867 24931.12 31755 56 575.7059 3183
CESC 7855 16973.26 31258 132 1099.5769 4086
LUSC 14643 24765.46 32872 1 910.8984 5287
  • Remember, you can always filter out missing values.
    • Think about this, what population are you summarizing now?
smoke_new %>%
  filter(!is.na(days_to_death)) %>%
  group_by(disease) %>%
  summarize(across(c(age_at_diagnosis, days_to_death),
                   .fns = list(
                     min = ~ min(.x, na.rm = TRUE),
                     mean = ~ mean(.x, na.rm = TRUE),
                     max = ~ max(.x, na.rm = TRUE))
                   )) %>%
  gt()
disease age_at_diagnosis_min age_at_diagnosis_mean age_at_diagnosis_max days_to_death_min days_to_death_mean days_to_death_max
BLCA 17501 25539.54 31442 56 575.7059 3183
CESC 7855 17596.08 28873 132 1099.5769 4086
LUSC 14643 25306.56 32872 1 910.8984 5287

7.11 Challenge 3 (10 minutes)

  1. Use group_by() with summarize() to calculate the minimum, maximum, and mean cigarettes_per_day by disease and vital status. Try doing this with and without across().

  2. Use across() with summarize() to calculate the number of distinct values of all columns that contain the word “days”, within each group defined by alive.

  3. Create a boxplot of days to death with disease on the x axis, and both the y axis and fill mapped to cigarettes_high. (Careful: You may need to change cigarettes_high somehow for this to work!)

8 More ggplot2 fun (Optional, if time)

8.1 geoms are layers on your plot

  • The really neat thing about geoms are that they can be layered onto your plot.
    • Below we’re layering another geom, geom_violin() on geom_boxplot().
  • We’re providing a couple of values to geom_violin():
    • alpha, which controls the transparency of the geom
    • width, which controls the overall width of the violins.
our_boxplot <- ggplot(smoke_new) +
  aes(y = cigarettes_per_day, 
      x = disease, 
      fill = disease) +
  geom_boxplot() +
  geom_violin(alpha = 0.2, width = 0.5) +
  ylim(0, 20)      # change the y limits here to make it easier to see

our_boxplot

8.2 Scales

  • In part 3 we learned how to change color and fill “scales” or palettes in ggplots.
  • There are many other types of scales.
  • Think of a scale as a layer added to a ggplot.

Scales are added to a ggplot for multiple reasons:

  • Change color palettes of continuous or discrete values
  • Map discrete values to particular colors (make “LUSC” = “blue”)
  • Transform numerical axes (such as change a plot to a log scale)
  • Change the breaks (tick values) in a plot
  • Format values in a scale (such as dollars: “$500”, “600”)

All scale functions begin with scale_ and are followed by the aesthetic it’s working on, and the transformation of the data.

8.3 Changing axis ticks and values using scale_x_continuous()

One use for scale_ functions is to modify the breaks (the tick values along an axis).

  • Below we’re including the scale_x_continuous() argument called breaks,
    • which is a numeric vector that specifies the values we want each tick to be at:
# original figure
our_plot <- ggplot(smoke_complete) +
  aes(x = age_at_diagnosis, 
      y = cigarettes_per_day, 
      color = disease) +
  geom_point(alpha = 0.5) +
  labs(title = "Cigarettes per Day versus Age at Diagnosis",
       x = "Age at Diagnosis",
       y = "Cigarettes Smoked per Day")

our_plot

# modify tick values on x-axis
our_plot <- our_plot + 
  scale_x_continuous(breaks = c(10000, 20000, 30000))

our_plot

8.4 Mini-challenge 3

  • How would you modify the y-axis to have the following breaks?

c(2, 4, 6, 8, 10)

Hint: it’s not scale_x_continuous

our_plot +

8.5 Better Labels

Another use for scale_ functions is to improve the formatting on the axis tick labels.

  • Suppose our y-axis was dollars instead of number of cigarettes.
  • We can supply label_currency() (from the scales package) as an argument to our labels argument in scale_y_continuous:
our_plot +
    scale_y_continuous(labels = scales::label_currency())

Other useful label_ functions for formatting labels:

  • label_date() - Lets you control the way dates are formatted on the axis
  • label_scientific() - lets you format the labels with scientific notation
  • label_pvalue() - formatting specific to p-values
  • label_percent() - percent formatting

8.6 scale_color_manual() - mapping discrete colors to categories

  • Last class we used scale_* functions for fill and color to change color palettes.

  • The scale_color_manual() function can be used to explicitly set colors by name, or using a built in palette.

  • If you want to explicitly name the colors values in your data set are mapped to, you can use the values argument of scale_color_manual()

our_plot +
  scale_color_manual(
    values = c(
      "LUSC" = "limegreen",
      "BLCA" = "orange",
      "CESC" = "pink"
    ),
    name = "Disease Category" # optional argument, a way to name the legend
  )

8.7 Color Palettes

  • If you don’t want to meticulously control every color, you can use palettes from various packages.
  • A list of lots of palettes are listed on this github by Emil Hvitfeldt.

Here’s one fun one: {ghibli} https://ewenme.github.io/ghibli/

# library(ghibli)

names(ghibli_palettes)
 [1] "MarnieLight1"    "MarnieMedium1"   "MarnieDark1"     "MarnieLight2"   
 [5] "MarnieMedium2"   "MarnieDark2"     "PonyoLight"      "PonyoMedium"    
 [9] "PonyoDark"       "LaputaLight"     "LaputaMedium"    "LaputaDark"     
[13] "MononokeLight"   "MononokeMedium"  "MononokeDark"    "SpiritedLight"  
[17] "SpiritedMedium"  "SpiritedDark"    "YesterdayLight"  "YesterdayMedium"
[21] "YesterdayDark"   "KikiLight"       "KikiMedium"      "KikiDark"       
[25] "TotoroLight"     "TotoroMedium"    "TotoroDark"     
ghibli_palettes[1:3] # first three palettes
$MarnieLight1
<colors>
#95918EFF #AF9699FF #80C7C9FF #8EBBD2FF #E3D1C3FF #B3DDEBFF #F3E8CCFF 

$MarnieMedium1
<colors>
#28231DFF #5E2D30FF #008E90FF #1C77A3FF #C5A387FF #67B8D6FF #E9D097FF 

$MarnieDark1
<colors>
#15110EFF #2F1619FF #004749FF #0E3B52FF #635143FF #335D6BFF #73684CFF 
ghibli_palette("PonyoMedium")

our_plot +
    scale_color_manual(
      values = ghibli_palette("PonyoMedium"))

  • Remember, if we want to do this with our_boxplot,
    • the scale we want to manipulate is not color, but fill,
    • so we need to use scale_fill_manual()
our_boxplot +
    scale_fill_manual(values = ghibli_palette("MononokeMedium"))

8.8 Mini-challenge 4

Pick one of the palettes from ghibli and use it to modify our_plot (hint: use ?ghibli_palette):

our_plot +

8.9 scale_color_gradient()

We’ve talked about variables with discrete values, but what about variables with continuous values?

ggplot(smoke_complete) +
  aes(x = age_at_diagnosis, 
      y = cigarettes_per_day, 
      color = year_of_birth) +
  geom_point(alpha = 0.5) +
  labs(title = "Cigarettes per Day versus Age at Diagnosis",
       x = "Age at Diagnosis",
       y = "Cigarettes Smoked per Day",
       color = "Year of Birth") +
  scale_color_gradient(low = "pink", high="blue")

8.10 Color Palettes: Viridis

A built in palette in ggplot2 is based on the colorblind friendly palettes in the viridis package (we don’t need the package to use these scales as newer versions of ggplot2 import them).

We have built in functions that use these:

  • For discrete values:
    • scale_color_viridis_d()
    • scale_fill_viridis_d()
  • For continuous values:
    • scale_color_viridis_c()
    • scale_fill_viridis_c()
our_plot +
  theme_minimal() +
  scale_color_viridis_d()

  • Four options are available: “magma” (or “A”), “inferno” (or “B”), “plasma” (or “C”), “viridis” (or “D”, the default option) and “cividis” (or “E”).
our_plot +
  theme_minimal() +
  scale_color_viridis_d(option = "A")

8.11 More Palettes

  • The {paletteer} package has a lot of other palettes from many packages included all in one:
our_plot + 
  theme_minimal() +
  paletteer::scale_color_paletteer_d(palette = "impressionist.colors::irissen")

our_plot + 
  theme_minimal() +
  paletteer::scale_color_paletteer_d("ggthemes::Tableau_10")

# View(palettes_d_names) for all palettes
  • Remember we can save this (by default, the last plot created):
ggsave("part4/output_figs/gg_myplot.jpg")

9 Practice

Try out chapters 2 and 3 in the R-Bootcamp:


If you are interested in learning more about ggplot:

10 Post Class Survey

Please fill out the post-class survey.

Your responses are anonymous in that I separate your names from the survey answers before compiling/reading.

You may want to review previous years’ feedback here.

11 Acknowledgements

  • Part 4 is based on the BSTA 504 Winter 2023 course, taught by Jessica Minnier.
    • I made modifications to update the material from RMarkdown to Quarto, and streamlined/edited content for slides.
    • Also made updates for new R functions/ specifications
  • Minnier’s Acknowledgements:
    • Written by Aaron Coyner and Ted Laderas and Jessica Minnier.
    • Based on the Intro to R materials from fredhutch.io and the R-Bootcamp